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Abstract 

' We show tight bounds for online Hamming distance computation in the cell-probe model 

with word size w. In this problem, we are given a fixed string F of length n and we consider 
a stream in which symbols arrive one at a time. The task is to output the Hamming distance 
q , between F and the string that consists of the last n symbols of the stream. We give a lower 

bound of fl (— logrt) time on average per output for this problem, where S is the number of 
bits needed to represent an input symbol. We argue that this bound is in fact tight within 

■ the model using an existing reduction from online to offline pattern matching problems. Our 
I/"") lower bounds hold both under randomisation and amortisation. 
OC 
00 

1 Introduction 

S, 

We consider the complexity of computing the Hamming distance. The question of how to 
compute the Hamming distance efficiently has a rich literature, spanning many of the most 
important fields in computer science. Within the theory community, communication complexity 
based lower bounds and streaming model upper bounds f o r the H a mming d i stance problem have 
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• *h ■ based lower bounds and streaming model upper bounds t o r tne H a mming d i stance problem nave 

been t he subject of particularly intense study |CDIMO,j IWooOi iHSZZOfll . I.TKSOSl . iBCRVWld . 



CRllJ. This previous work has however almost exclusively focussed on providing resource bounds 
(either in terms of space or bits of communication) for computing approximate answers. 

We give the first time complexity lower bounds for exact Hamming distance computation in 
an online or streaming context. Our results are in the cell-probe model where we also provide 
matching upper bounds. 

Problem (Online Hamming distance). For a fixed string F of length n, we consider a stream 
in which symbols arrive one at a time. For each arriving symbol, before the next symbol arrives, 
we output the Hamming distance between F and the last n symbols of the stream. 

We show that there are instances of this problem for which any algorithm solving it will 
require Q log n) time on average per output, where 5 is the number of bits needed to represent 
an input symbol and w is the number of bits per cell in the cell-probe model. Lower bounds 
in the cell-probe model also hold for the popular word-RAM model in which much of today's 
algorithms are given. The full statement and the main result of this paper is given in Theorem [TJ 
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Theorem 1. In the cell-probe model with w bits per cell there exist instances of the online Ham- 
ming distance problem such that the expected amortised time per arriving symbol is Q (— logn). 

Where 5 = w, for example we have an f2(logn) lower bound. Despite the relatively modest 
appearance of this bound, we prove that it is in fact tight within the cell-probe model. Fur- 
thermore, we argue that it is likely to be the best bound available for this problem without a 
significant breakthrough in computational complexity. 

The first cell-probe lower bounds in this o nline model were given recently for the problems of 



online convolutio n and multiplication [CJll| . This work introduced the use of the information 
transfer method PD04j | from the world of data structure lower bounds to this class of online or 
streaming problems. Information transfer captures the amount of information that is transferred 
from the operations in one time interval to the next and hence the minimum number of cells 
that must be read to compute a new set of answers. 

Our key innovation is the use of a random distribution over a subset of possible input 
streams, which together with a carefully designed fixed string F provide a lower bound for the 
information transfer between successive time intervals. T he use of such a purposefully designed 



input departs from the most closely related previous work CJ11] and also from much of the lower 
bound literature where simple uniform distributions over the whole input space often suffice. It 
also however necessarily creates a number of challenging technical hurdles which we overcome. 

The central fact that had previously enabled a lower bound to be proven for the online 
convolution problem was that the inner product between a vector and successive suffixes of the 
stream reveals a lot of information about the history of the stream. Establishing a similar result 
for online Hamming distance problem appears, however, to be considerably more challenging for 
a number of reasons. The first and most obvious is that the amount of information one gains 
by comparing whether two potentially large symbols are equal is at most one bit, as opposed to 
O(logra) bits for multiplication. The second is that a particularly simple worst case string could 
be found for the convolution problem which greatly eased the resulting analysis. We have not 
been able to find such a simple fixed string for the Hamming distance problem and our proof 
of the existence of a hard instance is non-constructive and involves a number of new insights, 
combining ideas from coding theory and additive combinatorics. 

The bounds we give are also tight within the cell-probe model. This can be seen by applica - 



tion of an existing general reduction from online to offline pattern matching problems CEPPllI ] . 
In this previous work it was shown that any offline algorithm for Hamming distance computa- 
tion can be converted to an online one with at most an O(logn) factor overhead. For details of 
these reductions we refer the reader to the original paper. In our case, the same approach also 
allows us to directly convert any cell-probe algorithm from an offline to online setting. An offline 
cell-probe algorithm for Hamming distance could first read the whole input, then compute the 
answers and finally output them. This takes 0(^-n) cell probes. We can therefore derive an 
online cell-probe algorithm which takes only O(^logn) probes per output, matching the new 
lower bound we give. We state the final result in Corollary [2 

Corollary 2. The expected amortised cell-probe complexity of the online Hamming distance 
problem is O(^logn). 

One consequence of our results is the first strict separation between the complexity o f exact 



and inexact pattern matching. Online exact matching can be solved in constant time |Gal81 | 
per new input symbol and our new lower bound proves for the first time that this is not possible 
for Hamming distance. Previous results ha d only shown such a separation for algorithms which 



made use of fast convolution computation [CJ11[. Our new lower bound immediately opens the 



interesting and as yet unresolved question of how to show the same separation for other distance 
measures such as, for example, edit distance. The edit distance between two strings is widely 
assumed to be harder to compute than the Hamming distance but this has yet to be proven. 
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Our lower bound also implies a matching lower bound for any problem that Hamming dis- 
tance can be reduced to. The most straightforward of these is online L\ distance computation, 
where the task is to output the L\ distance between a fix ed vec tor of integers and the last n 
numbers in the stream. A suitable reduction was shown in [LP08I |. The expected amortised cell 
probe complexity for the online L\ distance problem is therefore also log n ) P er new output. 



1.1 Technical contributions 

The use of information transfer to provide time lower bounds is not new, originating from |PD04| . 
However, applying the method to a our problem has required a number of new insights and 
technical innovations. Perhaps the most surprising of these is a new relationship between the 
Hamming distance, vector sums and constant weight binary cyclic codes. 

When computing the Hamming distance there is a balance between the number symbols 
being used and the length of the strings. For large alphabets and short strings, one would 
expect a typical outputted Hamming distance to be close to the length of the string on random 
inputs and therefore to provide very little information. This suggests that the length of the 
string must be sufficiently long to ensure that the entropy of the outputs is large (a property 
required by the information transfer method). On a closer look, it is not immediately obvious 
that large entropy can be obtained unless the fixed string that is being compared to the input 
stream is exponentially larger than the alphabet size. This potentially poses another problem 
for the information transfer method, namely that the entropy could end up being too small in 
relation to number of inputs over which the method operates. 

Our main technical contribution is to show that fixed strings of length only polynomial in 
the size of the alphabet exist which provide sufficiently high entropy outputs. Such strings, 
when combined with a suitable input distribution maximising the number of distinct Hamming 
distance output arrays, give us the overall lower bound. We design a fixed string F with this 
desirable probably in such a way that there is a one-to-one mapping between many of the 
different possible input streams and the outputted Hamming distances. This in turn implies 
large entropy. The construction of F is non-trivial and we break it into smaller building blocks, 
reducing our problem to a purely combinatorial question relating to vectors sums. That is, given 
a relatively small set V of vectors of length m, how many distinct vector sums can be obtained 
by choosing m vectors from V and adding them (element-wise). We show that even if we are 
restricted to pick vectors only from subsets of V, there exists a V such that the number of 
distinct vector sums is m^ m \ We believe this result is interesting in its own right. Our proof 
for the combinatorial problem is non-constructive and probabilistic, using constant weight cyclic 
binary codes to prove that there is a positive probability of the existence of a set V with the 
desired property. 



1.2 The cell-probe model 



Our bounds hold in a particularly strong computational model, the cell-probe model, introduced 
origi nally b y Minsky and Pap ert MP69l | in a different context and then subsequently by Fred- 



Fre78l | and Yao |Yao8ll |. In this model, there is a separation between the computing unit 



man 

and the memory, which is external and consists of a set of cells of w bits each. The computing 
unit cannot remember any information between operations. Computation is free and the cost is 
measured only in the number of cell reads or writes (cell- probes). This general view makes the 
model very strong, subsuming for instance the popular word-RAM model. In the word-RAM 
model certain operations on words, s uch as addition, subtraction and possibly multiplication 
take constant time (see for example Hag9§] for a detailed introduction). Here a word corre- 
sponds to a cell. As is typical, we will require that the cell size w is at least log 2 n bits. This 
allows each cell to hold the address of any location in memory. 
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The generality of the cell-probe model makes it particularly attractive for establishing lower 
bounds for dynamic data structure problems and many such results have been given in the past 
couple of decades. The approaches taken had historically been based only on comm unication 
complexity arguments and the chronogram technique of Fredman and Saks [FS89J. However 
in 2004, a breakthrough lead by Patrascu and Demaine gave us the tools to seal the gaps for 
several data structure problems [PD06l | as well as giving the first O(logn) lower bounds. The 
new technique is based on information theoretic arguments that we also deploy here. Patra§cu 
and Demaine also presented ideas which allowed them to express more refined lower bounds such 
as trade-offs between updates and queries of dynamic data structures. For a list of data s tructur e 
problems and their lower bounds using these and related techniques, see for example Pat08l |. 
Very recently, a new lower bound of Q ((log nj log log n) 2 ) was gi yen for the cell-probe complexity 
of performing queries in the dynamic range reporting problem Larl2j. This result holds under 
the natural assumptions of 0(logn) size words and polylogarithmic time updates and is another 
exciting breakthrough in the field of cell-probe complexity. 



1.3 Barriers to improving our bounds 

The cell probe bound we give is tight within the model but still distant from the time complexity 
of the fastest known RAM algorithms. For the online Ha mming di stance problem, the best 
known complexity is 0(\/n log n) time per arriving symbol [CEPPllI ]. It is therefore tempting 
to wonder if better upper or lower bounds can be found by some other not yet discovered 
method. This however appears challenging for at least two reasons. First, a higher lower 
bound than f2(log n) immediately implies a superli near offlin e lower bound for Hamming distance 
computation by the online to offline reduction of [CEPP11|. This would be a truly remarkable 
breakthrough in the field of computational complexity as no such offline lower bound is known 
even for the canonical NP-complete problem SAT. On the other hand, an improvement of the 
upper bound for Hamming distance computation to meet our lower bound would also have 
significant implications. A reduction that is now regarded as folklore (see appendix) tells us 
that any 0(f(n)) algorithm for Hamming distance computation, assuming pattern of length n 
and text of length 2n, implies an 0(/(n) 2 ) algorithm for multiplying square binary matrices 
over the integers. Therefore an O(logn) time online Hamming distance algorithm would imply 
an O(nlogn) offline Hamming distance algorithm, which would in turn imply an 0(n 2 log 2 n) 
time algorithm for binary matrix multiplication. Although such a result would arguably be less 
shocking than a proof of a superlinear lower bound for Hamming distance computation, it would 
be a significant breakthrough in the complexity of a classic and much studied problem. 



1.4 Previous results for exact Hamming distance computation 

Almost all previous algorithmic work for exact Hamming distance computation has considered 
the problem in an offline setting. Given a pattern, P and a text, T, the best current deterministic 
upper bound for offline Hamm i ng dist ance compu tation is an 0{\T\^\P\ log|P|) time algorithm 
based on convolutions Abr87 . Kos87 |. In Kar93| a randomis ed alg orithm was given that takes 
0((\T\/e 2 ) log 2 |P|) time which was subsequently modified in |lnd98l | to 0((\T\/e 3 ) log \P\). Par- 
ticular interest has also been paid to a bounded version of this problem called the /c-mismatch 
problem. Here a bou nd k is given and we need only report the Hamming distance if it is less 
than or equal to k. In LV86| . an 0(\T\k) algorithm was given that is not convolution based and 
uses O(l) time lowest common ancestor (LCA) operations on the suffix tree of P and T. This 
was then improve d to 0{ \T\\/k log k) time by a method that combines LCA queries, filtering 
and convolutions |ALP04f | . 
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2 Proof overview 



Let us first introduce some basic notation which we will use throughout. For a positive integer 
n, we define [n] = {0, . . . ,n — 1}. For a string S of length n and i,j G [n], we write S[i] to 
denote the symbol at position i, and where j ^ i, S[i,j] denotes the (j — i + l)-length substring 
of S starting at position i. The string S1S2 denotes the concatenation of strings S\ and S2. We 
say that S is over the alphabet £ if S[i] G £ for all i E [n]. The Hamming distance between two 
strings S and 5' of the same length n, denoted Ham(5', S'), is the number of positions i G [n] 
for which S^i] 7^ <S"[i]. 

The online Hamming distance problem is parameterised by a positive integers n and <5 and 
a string F G S n where |S| ^ 2 5 . The parameter 5 therefore denotes the smallest possible 
number of bits needed to represent a symbol in the alphabet, E. The problem is to maintain 
a string S G S n subject to an operation arrive(x) which takes a symbol x G S, modifies 5 by 
removing the leftmost symbol S"[0] and appending x to right of the rightmost symbol S[n — 1], 
and then returns the Hamming distance Ham(i ? , S) between F and the updated S. We refer to 
the operation arrive(x) as the arrival of symbol x. 

In order to prove Theorem Q] we will consider a carefully chosen string F with a random 
sequence of n arriving symbols and show that the expected running time over these arrivals is 
Q(^nlogn). We let the n-length string U G S n contain the n arriving symbols of the update 
sequence and we use t G [n] to denote the time, where the operation arrive(?7[t]) is said to occur 
at time t. When referring to updates of U that take place outside some time interval [io,ti], we 
use the notation U[to,ti] c to denote the sequence ?7[0] • • • U[tQ — l]U[ti + 1] • • • U[n — 1]. The 
choice of F and distribution of updates U, which we defer to Section [3l is the most challenging 
aspect of this work. 

We let the n-length array D G [n + l] n denote the Hamming distances outputted during the 
update sequence U such that, for t G [n], D[t] = Ham(i ? , S), where S has just been updated by 
the arrival of U[t]. 



Following the overall approach of Demaine and Patra§cu PDOJ] we will consider adjacent 
time intervals and study the information that is transferred from the operations in one interval 
to the next. Let to,t\,t2 G [n] such that to ^ ti < ti and consider any algorithm solving the 
online Hamming distance problem. We define the information transfer, denoted IT(to,ti,t2), 
to be the set of memory cells c such that c is written during the first interval [to, ti], read at 
some time t in the subsequent interval [ti + 1,^2] an d not written during [ti + l,t]. Hence 
a cell that is overwritten in the second interval before being read, is not in the information 
transfer. The cells of the information transfer contain all the information about the arriving 
symbols in the first interval that the algorithm uses in order to correctly output the Hamming 
distances D[t\ + l,t2] during the second interval. This fact is capt ured in the following lemma, 



which was stated with small notational differences as Lemma 3.2 in [Pat.08]. For completeness we 
include a full proof. The overall idea of the proof is to describe an encoding of the information 
transfer such that any algorithm running on the n arrivals in U, where the symbols outside 
the first interval, U[to,ti] c , are fixed to some known values Ua x [to,ti] c , can correctly output the 
Hamming distances D\t\ + 1^2] by using the values Uft K \to,t\\ c and decoding the information 
transfer. 



Lemma 3 (Lemma 3.2 of jPat08|). The entropy 



H(D[t l + l,t 2 ] I U[t ,h] c = U &x [to,h] c ) ^w + 2wE[\IT(t ,t 1 ,t 2 )\ \ U[t ,h] c = U &x [t ,h] c ] . 

Proof. The average length of any encoding of D[t\ + 1, t2] (conditioned on Df[ x [tQ, ti] c ) is an upper 
bound on its entropy. We use the information transfer as an encoding in the following way. For 
every cell c in the information transfer IT(to,ti,t2), we store the address of c, which takes at 
most w bits under the assumption that the cell size can hold the address of every cell, and we 
store the contents of c, which is a cell of w bits. In total this requires 2w ■ \IT(tQ,ti,t2)\ bits 
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which are stored consecutively as an array of cells in the memory. In addition we store the size of 
the information transfer, |/T(io, £1, £2)!) so that any algorithm decoding the stored information 
knows where the end of the array is. Storing the size of the information transfer requires w bits, 
thus the average length of the encoding is w + 2w ■ E[|IT(£o, £i> £2)! | U[to,ti] c = C/fi x [to, £i] c ] • 

In order to prove that the described encoding is valid, we describe how to decode the stored 
information. We do this by simulating the algorithm. First we simulate the algorithm from 
time to £q — 1. We have no problem doing so since all necessary information is available 
in Z?fi x [to, ti] c , which we know. We then skip from time to to t\ and resume simulating the 
algorithm from time t\ + 1 to t 2 - in this interval the algorithm outputs the Hamming distances 
in Dt\ + I,i2- In order to correctly do so, the algorithm might need information about the 
symbols that arrived during the interval [£o,£i]. This information is only available through the 
encoding described above. When simulating the algorithm, for each cell c we read, we check 
if the address of c is contained in the list of addresses that was stored. If so, we obtain the 
contents of c by reading its stored value. Each time we write to a cell whose address is in the list 
of stored addresses, we remove it from the stored list, or blank it out. Note that every cell we 
read whose address is not in the stored list contains a value that was written last either before 
time to ° r after time t\. Hence its value is known to us. □ 

While an encoding of the information transfer provides an upper bound on the entropy of 
the outputs in the interval \t\ + 1, £2]) the question is how much information about the symbols 
arriving in [£o,£i] needs to be communicated from [£o,£i] to [t\ + 1,£2]- We answer this question 
in the next lemma by providing a lower bound on the entropy. The lemma is key to this paper 
and its proof is given in Section [3] where we show that there is a string F such that for a large 
set of updates U, the outputs D\t\ + 1,£2] uniquely specify a constant fraction of the symbols 
that arrived in [£o,£i]- 

Lemma 4. There exists a string F and distribution of updates U such that for any two intervals 
[to,t\] and [t± + 1,^2] of the same length 2^ ^ k^fri where k > is a constant, the entropy 

H(D[t 1 + l,t 2 ] I U[t 0l h] c = t/fi x [t ,ii] c ) G n(*.2*). 
We combine Lemmas [3] and H] in the following corollary. 

Corollary 5. There exists a string F and distribution of U such that for any two intervals 
[to,ti] and [ii + l,^] of the same length 2 e kyjn, where k > is a constant, and any algorithm 
solving the online Hamming distance problem, 

E[|JT(io,ii,i 2 )|] 6 fi (^- 2 ') • 
Proof. For U[to,t\] c fixed to Ua x [tQ, ti] c , by comparing Lemmas [3] and [J] we see that 

E[|/r(t ,ti,t 2 )| I ^o,ii] c = ^x[to,ti] c ] > 5 -^r~\- 
The result follows by taking expectation over U[to,ti] c under the random sequence U. □ 
We are now in a position to prove Theorem [TJ 

Proof of Theorem [TJ The main idea is to sum the information transfer between many pairs of 
time intervals and show that over the n arrivals in U, a large amount of information must have 
been transferred. To capture this idea, we conceptually think of a balanced binary tree over the 
time axis, where the leaves, from left to right, represent the time t from to n — 1, respectively. 
An internal node v is associated with the times to, t\ and ti such that the two intervals [toj^i] 
and [ti + 1, £2] span the left subtree and the right subtree of v, respectively. The information 
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transfer IT(v) associated with v is JT(to, £1, £2)- This idea was introduced in [PD04j | in the 
context of showing a lower bound for the partial sums problem. Here we use a refined version of 
the technique which we need in order to cope with short intervals; the lower bound on the size 
of the information transfer is by Corollary [5] only guaranteed for sufficiently l arge in tervals, 

A crucial property of the information transfer method which was proven in PD04] is that the 
cell probes counted in some information transfer IT(v) associated with a node v of the tree are 
not counted in IT(v') of any other node v' . Therefore, using linearity of expectation, we have 
that the sum over all nodes, ¥,[IT(v)] is a lower bound on the expected number of cell probes 
over n updates. However, the lower bound on E[JT(v)] given by Corollary [5] is only guaranteed 
for nodes representing sufficiently large intervals. Fortunately, this includes all nodes in the top 
log y/n — O(l) levels of the tree. By summing the information transfer over these nodes we have 
that any algorithm performs f2(^nlogn) expected cell probes over n updates. This concludes 
the proof of Theorem [TJ The remainder of the paper concerns the proof of Lemma U] upon which 
the main result relies. 

Although we have only shown the existence of probability distributions on the inputs for 
which we can prove lower bo unds on the expected running time of any deterministic algorithm, 
by Yao's minimax principle [Yao77l | this also immediately implies that for every (randomised) 
algorithm, there is a worst-case input such that the (expected) running time is equally high. 
Therefore our lower bounds hold equally for randomised algorithms as for deterministic ones. □ 



3 The hard instance 

In this section we discuss the proof of Lemma [3J the lower bound on the conditional entropy of 
D[ti + 1, £2] . We will describe a string F with the property that the outputs D[t\ + 1, £2] during 
the interval \t\ + 1, £2] determine the values of a constant fraction of the symbols in f7[£o, £1] - 
By picking the update sequence U from a large set of strings, implying that the entropy of 
D[t% + 1, i 2 ] is large. 

The description of the hard instance, i.e. the string F, is given in two parts. In this section 
we use a certain string R to construct F. A full description of R itself is given separately in 
SectionU]and is where the non-constructive part of the proof lies. For the purpose of constructing 
F, we need R to have a particular property, which is stated in Lemma [6] below. The reason why 
Lemma [6] is important will be clear shortly. First we introduce some notation. 

For a string S\ of length m and a string S2 of length 2m, we write HamArray(5i , S2) to denote 
the array of length m + 1 such that, for i£ [m + 1], HamArray(Si, 5 < 2)[i] = Ham (Si, S2 [i, i + 
m — 1]). That is, HamArray(Si, S2) is the array of Hamming distances between Si and every 
m-length substring of S2. 

Lemma 6. For any r there exists a string R G [r] r such that 

log I { HamArray(i?, U') | U'£[r] 2r }\ £ n(rlogr). 

The string R is partitioned into many smaller substrings containing distinct symbols. By 
choosing the substrings at random, we show that there is a positive probability of getting a string 
R with the desired property and hence such an R exists. The proof of Lemma[6]will demonstrate 
an interesting connection between Hamming distances, vector sums and cyclic codes. 

For the construction of F we set r = 2 s — 1. Recall that 5 is the number of bits needed to 
represent a symbol of the alphabet. The minus one ensures that we can reserve one symbol that 
does not appear in the alphabet [r] over which R is defined. We let * denote this symbol. 

From now on, let R be a string with the property of Lemma [HJ We define U C [r] 2r to 
be a largest set of strings such that for any two distinct U'^U^ £ IA, HamArray(i2, U[) 7^ 
HamArray(i2, U^)- The set IA is not unique and is chosen arbitrarily as long as its size is 
maximised. By Lemma [6] we have that the size of IA is at least r cr for some constant c. We will 
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see that setting the update sequence U to consist of strings chosen randomly from IA yields a 
sufficiently rich variety of outputted Hamming distances D[t\ + 1, £2] to make its entropy large. 

3.1 The fixed string F 

For the construction of F to work we need the length n of F to be sufficiently large in comparison 
to r. To avoid unnecessary technicalities, think of n as being at least r 2 . Observe that with the 
cell size w = logn and n being any polynomial in r (i.e. 5 = O(logn)), our lower bound for the 
online Hamming distance problem simplifies to O(logn). 

The perhaps easiest way to describe F is to start with an n-length string that consists entirely 
of the symbol *, i.e. the string {*} n and then replace r-length substrings {*} r with the string R 
as follows. For each suffix of {*} n that has a power-of-two length and is longer than r, replace 
its r-length prefix with the string R. There will therefore be a logarithmic number of copies of 
R in F, and they all start at power-of-two positions from the end of F. 

The benefit of this construction is seen with the following reasoning, which outlines the proof 
of Lemma HI Randomly pick an n-length string W that is the concatenation of n/(2r) 2r-length 
strings chosen independently and uniformly at random from IA. Consider an interval \t\ + 1,^2] 
that has length 2 i (which we assume is at least a constant times r) and let U be the update 
sequence such that the updates during [ioi^i] are induced by W (i.e. f/[£o,ti] = W[fo,ii]) 
whereas the updates U\t$,ti\ c are fixed to some arbitrary Usx[to, ti] c over the alphabet [r]. 

The contribution to the outputs of D[t\ + 1^2] can be split into two parts: those coming 
from mismatches with U[to,ti] and those coming from mismatches with U[to,ti] c . Since the 
latter is known to us, we can derive from D\t\ + 1,2a] the contribution from the former. From 
the construction of F, it is not too difficult to see that for most of the second half of the interval 
[ti + 1,^2]) ah the unknown updates of C/[to,^i] are aligned with either * (causing a mismatch) 
or the occurrence of R at the start of the (2^)-length suffix of F. Around half of the 2r-length 
substrings of U\pQ, t\] drawn from IA will in turn slide over this copy of R while every other symbol 
of C/[to,£i] is aligned with *. Each such substring U' £ U contributes with HamArray(i2, U') 
to the outputs and as we reasoned above, HamArray(i?, U') can therefore be derived from the 
corresponding substring of D[t\ + 1,^2]- By definition there is only one string U' £ IA that can 
give rise to HamArray(i?, V), hence we can uniquely identify which string U' of IA was chosen. 

In total, around half of the substrings from U in [/[iojti] are uniquely identified through 
the outputs D[t\ + 1, £2]- More precisely, Q(2r/r) substrings are uniquely identified. As the 
substrings were chosen uniformly at random from IA, we have by Lemma [6] that the entropy of 
D[ti + l,t 2 ] is n(2 e /r • rlogr) = 0(logr • 2 e ) = Q(5 • 2 l ) since r was defined to be 2 s - 1. This 
property holds only for sufficiently large intervals, where our suggested choice of n being at least 
r 2 comfortably makes the property true for intervals of length at least some constant times yjn. 

To sum it up, we have described a string F and a set IA such that if the update sequence U 
is picked by concatenating strings chosen independently and uniformly at random from IA, we 
have the lower bound on the conditional entropy of D[t\ + 1, £2] of Lemma [H 

4 A string with many different Hamming arrays 

Our remaining task is now to show that a string R does indeed exist which gives many dif- 
ferent Hamming arrays, as demanded by Lemma [6] This is both the most important and 
the most technically detailed part of our overall lower bound proof. To recap, we claim that 
for any r there exists a string R € [r] r which permits a large number of distinct Hamming 
arrays when compared to every string in [r] 2r . Precisely, that there exists a string R with 
log I { HamArray(i?, U') \ U' G [r] 2r } | € n(rlogr). 
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Figure 1: Setting symbols of U' renders a large set of possible Hamming distance outputs. 



4.1 The structure of R 

The string R is constructed by concatenating p? = 0(r 2 / 3 ) substrings each of length p = 0(r 1//3 ), 
containing exactly two symbols. One of the symbols will be common to all substrings and 
the other unique to that substring. Denote the i-th such substring by p$ and hence R = 
PoPi " " " P(fi 2 -i)- Each substring of R, pi will corresponds to a binary vector Vi in the following 
natural way. Let V = {vo, . . . , Vr^^} be a multi-set of //-length vectors from {0, 1}^ which 
will have the property set out in Lemma [7] below. The string pi 6 {*, i}^ is then given by taking 
Vi and replacing every occurrence of 1 with an occurrence of the symbol i and similarly every 
with the symbol * (formally * = p? £ [r]). For example, if p = 3, vi = (0, 1,0) and vj = (1,1,0) 
then p2 = *2-k and p-j = 77*. Observe that pi contains only two symbols and the symbol i 
occurs only in p{. We will assume wlog that R is a perfect cube r = p? and that p — 1 is a prime 
which we will also require below. The result generalises to arbitrary r via a simple reduction to 
a smaller r which meets the assumptions. 

The substrings pi of R can be seen as encodings of binary vectors from a multi-set V by the 
method described above. We will also show that by suitably selecting the updates U' from the 
unique symbols in R, the Hamming distances that result will be element- wise sums of the vectors 
from this multi-set. We will therefore have reduced the problem of finding a string giving a large 
number of distinct Hamming arrays to that of finding multi-sets with a large number of distinct 
vector sums. Lemma [7] captures the property that we require V formally. The reason that the 
property must hold for large multi-subsets of V will become apparent from the construction 
below. Intuitively it is because we will 'use up' vectors from V as we proceed. 

Lemma 7. For any p > 40 such that p—1 is a prime, there exists a multi-set V of vectors from 
{0, 1} M such that \V\ = p 2 and for any multi-subset V C V of size at least (63/64)|V|, 

| { w\ + ■ ■ ■ + | w\ , . . . , are distinct elements from V' }| ^ p^^ 10 ^ . 

4.2 Vector sums and Hamming arrays — the proof of Lemma [6] 

We can now prove Lemma [6) Our method is to show how to obtain a large number of Hamming 
arrays from the string R by incrementally modifying an initial string of length 2r which R will be 
compared to and which does not contain any symbol in R. Let us call such a string U' = {o} 2r , 
and let o be a special symbol which does not occur in R (formally o = p 2 + 1 6 [r]). 

Consider the first alignment of R and U' where R[0] is aligned with U'[0] (refer to Figure HJ 
top). From the set V, pick any p vectors and identify their corresponding substrings pi of R. 
For each such substring, set the symbol of U' that is directly to the right of pi in the alignment 
to i. For example, in Figure [H where we have explicitly written out po, p§ and pj, suppose that 
Vo, t>5 and v? are among the p vectors we picked. As shown in the figure, we set three symbols 
of U' to 0, 5 and 7, accordingly. 

Now consider the first p Hamming distances in HamArray(72, U'). We can think of these as 
being outputted as we slide the string U', p characters to the left (relative to R). The bottom 
part of the figure illustrates the alignment after sliding U' . We see that the number of matches 
at each one of the p alignments correspond exactly to the vector sum of the p vectors we picked 
(in reverse order, to be precise). 
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We now repeat this process by picking [i new vectors from V and setting symbols of U' 
accordingly. We cannot pick a vector for which the position is already occupied by a symbol 
other than o. For the example in the figure, we would not be able to pick V4 or vq. If we did 
so, we would risk changing the previous Hamming distances. This is in-fact the reason that 
Lemma [7J must hold not only for V but also for large multi-subsets of V. The procedure of 
picking fj, vectors and sliding U' by /1 steps is repeated a total of /i/64 times, over which a total 
of /U 2 /64 symbols of U' have been set. As |V| = fi 2 , we have in each one of the /i/64 rounds had 
access to pick at least (63/64) |V| vectors. Thus, by Lemma [71 in each round we had a choice of 
at least ^(^/ 10 ) distinct Hamming distance outputs. For correctness it is important to observe 
that setting a symbol of U' to some value i only makes this symbol contribute to matches over 
exactly the fj, alignments it is intended for. For all other alignments the symbol will mismatch. 

The process of performing /it/64 rounds as above is itself repeated /i times. To see how this 
is possible, apply the following trick: slide U' by one single step. By doing this, we offset all 
symbols, freeing up every occupied position of U' so that we can perform the process above 
again. The single-slide trick can be repeated (/i — 1) times, after which occupied positions will 
no longer necessarily be freed up but instead reoccupied by symbols that were set during the 
first rounds. 

To sum up, we slide U' a total of (/i • (/x/64) + 1) • /x ^ /i 3 /32 = r/32 steps. Over these steps, 
by Lemma [TJwe have the choice of at least (jj,W 10 ))W M >l* = /> 3 /640) = ^(r/640) Hamming 
array outputs. So we have log | { HamArray(i2, U r ) \ U' G [r] 2r } | ^ (r/640) log [i G Q(r5) since 
fj, 2 rf / 2_1 . This completes the proof of Lemma [6) 

4.3 Vector sets with many distinct sums — the proof of Lemma [7] 

In this section, we prove Lemma [7J We first rephrase it slightly for our purposes. For any V C 
{0, 1}^, we define Sum(V') = { W\ + • • • + Wn \ w\, . . . , Wn are distinct elements from V' }. Here 
vector addition is element-wise and over the integers. We will show that exists a multi-set V 
of vectors from {0, 1} M such that |V| = /i 2 and for any multi-subset V' C V of size at least 
(63/64)| V|, we have that |Sum(V')| > M Wl0) • 

Our approach will be an application of the probabilistic method. Specifically, we will show 
that when sampled uniformly at random, the expected value of miny/ |Sum(V')| ^ ^W 10 ) and 
hence there exists a V with the required property. To prove this result, we will require the 
following lemma from the fiel d of Cod ing Theory, which is tailored for our needs and is a special 



AGM 92], For our purposes, a binary constant- weight cyclic code 
can be seen simply as set of bit-strings (codewords) with two additional properties. The first is 
that all codewords have constant Hamming weight fj,, i.e. they have exactly fi ones. The second 
is that any cyclic shift of a codeword is also a codeword. 

Lemma 8 ( |AGM92| ). For any [i ^ 4 such that is a prime and any odd q G [fj], there is 



a binary constant-weight cyclic code with (/i — l) q codewords of length /i(/x — 1) and Hamming 
weight /x such that any two codewords have Hamming distance at least 2(/i — q). 

We will show that Lemma [7J holds for V of size (pi — l)/i < /i 2 - We first consider a random 
multi-set V where the vectors are chosen independently and uniformly at random from {0, 
Later in the proof we will fix the multi-set V and show that it has the property set out in the 
statement of the lemma. 

A multi-subset of V of size \x can be represented by a |F|-length bit string with Hamming 
weight /1, where a 1 at position i means that the zth vector of V is in the multi-subset. Let C 
be the binary code that contains all codewords of length |V| with Hamming weight /i. That is, 
C represents all /i-sized multi-subsets of V. To shorten notation, we refer to c G C as both a 
codeword and a vector set. 

We now let C C C be a smaller code such that C is cyclic (i.e. c[0]c[l] • • • c\p — 1] G C implies 
that also c[l]c[2] • • • c[/i — l]c[0] G C) and the Hamming distance between any two codewords in 
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C is at least 7/i/4. We choose C such that its size is Qu — l) q , a value between (// — l)^/ 9 and 
(// — l)^/ 8 , where q is any odd integer in the interval [/i/9, fi/8]. The existence of such a C is 
guaranteed by Lemma [HJ Like for C, every codeword of C has Hamming weight fi. 

For c G C, we define the 6a//, Ball(c) = {c | c G C and Ham(c, c) ^ ///16} to be the set of 
bit strings in C of weight [i at Hamming distance at most ^/16 from c. Hence, the |C| balls are 
all disjoint. We have that for any c 6 C, using the fact (£) ^ (ae/b) b , 

For c G C, we write Sum(c) to denote the vector in [fi + 1]^ obtained by adding the \x 
vectors in the vector set c. For any c\ G Ball(ci) and c 2 G Ball(c2), where ci,C2 G C are 
distinct, we now ask what is the probability that Sum(ci) = Sum(c 2 ). From the definitions 
above, it follows that c\ and C2 must differ on at least 7/u/4 — 2(/i/16) ^ /x positions, implying 
that the two vector sets c\ and c~2 have at most /x/2 vectors in common, thus at least fj>/2 
of the vectors in c\ are not in c~2. Let v\, ... ,11^12 denote those vectors. In order to have 
Sum(ci) = Sum(c 2 ), for each position i G \p], the sum Si = vi[i] + ■ ■ ■ + u^/2^] must be some 
specific value (that depends on the other vectors). Due to independence between vectors and 
their uniform distribution (any element of any vector is 1 with probability 1/2), the most likely 
value of S{ is /i/4 for which half of the vectors v\, . . . , v^/2 have a 1 at position i. The probability 
of having Si = ///4 is exactly (^) • 2~^ 2 ^ (^/2)~ 1 / 2 , as for any a, ( a %) ^ 2 a /^/a. Due 
to independence between the elements of a vector, the probability that all sums so,...,Sn,—i 
combined yield Sum(ci) = Sum(c2) is upper bounded by (/i/2)^/ 2 . Thus, 

Pr(Sum(ci)=Sum(c 2 )) < (f)^- (1) 

For two distinct c±,C2 G C, we define the indicator random variable I(ci,C2) to be if and 
only if there exists a c\ G Ball(ci) and a ca G Ball(c 2 ) such that Sum(ci) = Sum(c2). Taking 
the union bound over all c\ G Ball(ci) and C2 G Ball(c2), and using the probability bound in 
Equation ([TJ, we have 

ft^J-O) < |Ball( Cl ) M Ba.ll(c 2 )|.(f)-" /2 < (£) W) (£) ^ < (Jj)"' 8 . (2) 

For any c\ G C, we now define the indicator random variable V(c\) to be if and only if 
there exists some C2 G C \ {c\} such that I(ci, C2) =0. Taking the union bound over all c 2 G C 
and using Equation ([2]), we have 




We say that Ball(ci) is good iff I'(c\) = 1. From the definitions above we have that for 
every c\ in a good ball, there is no other ball that contains a C2 such that Sum(ci) = Sum(c 2 ). 
It is possible that Sum(ci) = Sum(c2) if c 2 is from the same ball as c\ though. The expected 
number of good balls is, by linearity of expectation and Equation ([3J), E [X^ceC I'( c )] ^ | C| /2. 
The conclusion is that there is a multi-set V of vectors for which at least \C\/2 balls are good, 
hence Sum(F) > (// - l)^ 9 / 2 - 

now on, we fix V to be such a multi-set. It remains to 
show that for any multi-subset V' of V of size (63/64)|l/|, Sum(y') is also large. 

Over all codewords in C, the total number of Is is \C\fi. As C is cyclic, the number of 
codewords that have a 1 in position i G [\V\] is the same as the number of codewords that have 
a 1 in any position j ^ i. Thus, for each one of the \V\ positions there are exactly |C|^/|y| 
codewords in C with a 1 in that position. 
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Let V be any multi-subset of V of size (63/64)|V|. Let J be the set of |V|/64 positions 
that correspond to the vectors of V that are not in V' . For each j G J and codeword c G C, 
we set c[j] to 0. The total number of Is is therefore reduced by exactly (|V|/64) • (|C|/i/|U|) = 
|C|/i/64. The number of codewords of C that have lost /u/16 or more Is is therefore at most 
(|C| / u/64)/(/x/16) = |C|/4. Let C C C be the set of codewords c that have lost less than /x/16 
Is and for which Ball(c) is good. As there are at least \C\/2 good balls, \C'\ ^ |C|/4. Let the 
code C" be obtained from C by for each codeword d G C", replace every removed 1 with a 1 at 
some other arbitrary position that is not in J. Thus, every codeword of C" has Hamming weight 
\i and they all belong to \C"\ = \C'\ ^ |C|/4 distinct good balls. Further, every codeword of C", 
seen as a vector set, only contains vectors from the subset V' . From the definition of a good ball 
we have that at least |C|/4 distinct vector sums can be obtained by adding fx vectors from V' . 
Thus, Sum(V') > 0* - l)^ /9 /4 

^ ^W 10 ) when \x > 40. This completes the proof of Lemma [71 
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A Appendix 



A.l Folklore matrix multiplication reduction^ 

We show a reduction from binary matrix multiplication to pattern matching under the Hamming 
distance. 

Consider the following reduction. Assume the input is of two binary matrices A and B of sizes 
m x £ and £xn. For matrix A, we write x for each and for each 1 we write its column number. 
For example, A = ((0, 0, 1), (1, 0, 1)) is translated to A' = ((x, x, 3), (1, x, 3)). For matrix B, 
we write y for each and the row number for each 1. For example, B = (0, 1), (1,0), (0,0)) is 
translated to B' = ((y, 1), (2,y), (y,y)). Now create pattern P as the concatenation of the rows 
of A' and text T as the concatenation of the columns of B' with the unique symbol $ inserted 
after every column and add £(m — 1) $ symbols at the beginning and end of T. So, in our 
example P = xx31x3 and t = $$$y2y$12y$$$. 

We now count the number of matches between P and T at each alignment, giving in this 
case 0, 0, 0, 0, 1, 0, 0, 0, 0, meaning that the second row of A scored 1 when multiplied with the 
second column of B. The trick is that the $ symbols force at most one substring of the pattern 
corresponding to a row in A to match one substring of T corresponding to a column of B at any 
given alignment. 



This reduction is attributed to Ely Porat according to Raphael Clifford. Ely Porat attributes it to Piotr 
Indyk. Piotr Indyk denies this. 
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